[SPARK-32516][SQL] 'path' option cannot coexist with load()'s path parameters #29328

imback82 · 2020-08-02T21:11:23Z

What changes were proposed in this pull request?

This PR proposes to make the behavior consistent for the path option when loading dataframes with a single path (e.g, option("path", path).format("parquet").load(path) vs. option("path", path).parquet(path)) by disallowing path option to coexist with load's path parameters.

Why are the changes needed?

The current behavior is inconsistent:

scala> Seq(1).toDF.write.mode("overwrite").parquet("/tmp/test")
                                                                                
scala> spark.read.option("path", "/tmp/test").format("parquet").load("/tmp/test").show
+-----+
|value|
+-----+
|    1|
+-----+


scala> spark.read.option("path", "/tmp/test").parquet("/tmp/test").show
+-----+
|value|
+-----+
|    1|
|    1|
+-----+

Does this PR introduce any user-facing change?

Yes, now if the path option is specified along with load's path parameters, it would fail:

scala> Seq(1).toDF.write.mode("overwrite").parquet("/tmp/test")
                                                                                
scala> spark.read.option("path", "/tmp/test").format("parquet").load("/tmp/test").show
org.apache.spark.sql.AnalysisException: There is a path option set and load() is called with path parameters. Either remove the path option or move it into the load() parameters.;
  at org.apache.spark.sql.DataFrameReader.verifyPathOptionDoesNotExist(DataFrameReader.scala:310)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
  ... 47 elided

scala> spark.read.option("path", "/tmp/test").parquet("/tmp/test").show
org.apache.spark.sql.AnalysisException: There is a path option set and load() is called with path parameters. Either remove the path option or move it into the load() parameters.;
  at org.apache.spark.sql.DataFrameReader.verifyPathOptionDoesNotExist(DataFrameReader.scala:310)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:250)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:778)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:756)
  ... 47 elided

The user can restore the previous behavior by setting spark.sql.legacy.pathOptionBehavior.enabled to true.

How was this patch tested?

Added a test

imback82 · 2020-08-02T21:16:56Z

I also see that path is treated different depending on the number of paths. For example,

// For single path, "path" option is overridden with path given in "load"
scala> spark.read.option("path", "/tmp/test").format("parquet").load("/tmp/test").show
+-----+                                                                         
|value|
+-----+
|    1|
+-----+

// For multiple paths, "path" option is "added" to the paths given in "load"
scala> spark.read.option("path", "/tmp/test").format("parquet").load("/tmp/test", "/tmp/test").show
+-----+
|value|
+-----+
|    1|
|    1|
|    1|
+-----+

Should I just fix this inconsistency instead of this PR? If so, what should be the expected behavior (just ignore path option if paths are passed to load or always add path to the given paths)?

cc: @cloud-fan @dongjoon-hyun @viirya (Thanks in advance!)

SparkQA · 2020-08-02T23:22:26Z

Test build #126945 has finished for PR 29328 at commit 63f3d60.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

imback82 · 2020-08-03T00:39:19Z

FYI, the existing test "SPARK-11544 test pathfilter" fails if I change

assert(spark.read.options(extraOptions).json(path).count() === 2)

to

assert(spark.read.options(extraOptions).format("json").load(path).count() === 2)

We expect the same behavior b/w these two APIs right?

SparkQA · 2020-08-03T05:28:01Z

Test build #126949 has finished for PR 29328 at commit 296d4bb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

SparkQA · 2020-08-06T06:17:27Z

Test build #127120 has finished for PR 29328 at commit 17c94a4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-06T06:39:14Z

Test build #127118 has finished for PR 29328 at commit 9b6be0c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-06T07:05:01Z

Test build #127121 has finished for PR 29328 at commit 4c6ada7.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-06T23:53:44Z

Test build #127153 has finished for PR 29328 at commit cd15fe2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

SparkQA · 2020-08-12T00:49:20Z

Test build #127346 has finished for PR 29328 at commit 808b7c0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-12T03:08:44Z

Test build #127350 has finished for PR 29328 at commit 63c9383.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-08-12T06:00:20Z

LGTM except one question: do we still need this fix? https://github.com/apache/spark/pull/29328/files#r468330253

imback82 · 2020-08-13T00:37:04Z

LGTM except one question: do we still need this fix? https://github.com/apache/spark/pull/29328/files#r468330253

Yes, without it, the following tests fail:

 org.apache.spark.sql.execution.datasources.csv.CSVLegacyTimeParserSuite.SPARK-23846: schema inferring touches less data if samplingRatio < 1.0
 org.apache.spark.sql.execution.datasources.csv.CSVv1Suite.SPARK-23846: schema inferring touches less data if samplingRatio < 1.0
 org.apache.spark.sql.execution.datasources.csv.CSVv2Suite.SPARK-23846: schema inferring touches less data if samplingRatio < 1.0
 org.apache.spark.sql.execution.datasources.json.JsonLegacyTimeParserSuite.SPARK-11544 test pathfilter
 org.apache.spark.sql.execution.datasources.json.JsonLegacyTimeParserSuite.SPARK-23849: schema inferring touches less data if samplingRatio < 1.0
 org.apache.spark.sql.execution.datasources.json.JsonV1Suite.SPARK-11544 test pathfilter
 org.apache.spark.sql.execution.datasources.json.JsonV1Suite.SPARK-23849: schema inferring touches less data if samplingRatio < 1.0
 org.apache.spark.sql.execution.datasources.json.JsonV2Suite.SPARK-11544 test pathfilter
 org.apache.spark.sql.execution.datasources.json.JsonV2Suite.SPARK-23849: schema inferring touches less data if samplingRatio < 1.0

Btw, this is an existing bug. For example, for "SPARK-11544 test pathfilter", if I change spark.read.options(extraOptions).json(path) to spark.read.options(extraOptions).option("path", path). format("json").load(), it would fail. The changes in this PR are surfacing the existing issue.

imback82 · 2020-08-13T00:41:40Z

@cloud-fan, should we also make the behavior the same on the save path? Currently, the "path" option will be overwritten if a path is passed to save. Let me know if we need to enforce the same check (no coexistence), and I can do it either in this PR or as a follow-up PR. Thanks!

cloud-fan · 2020-08-13T03:38:14Z

Btw, this is an existing bug.

Can we have a separate PR for this bug? Then we can backport the bug fix.

should we also make the behavior the same on the save path?

This is a good point. We never document the overwrite behavior anywhere and people may not realize the path option is overwritten by the save path parameter. Let's do it in a new PR.

viirya · 2020-08-13T05:49:34Z

docs/sql-migration-guide.md


  - In Spark 3.1, NULL elements of structures, arrays and maps are converted to "null" in casting them to strings. In Spark 3.0 or earlier, NULL elements are converted to empty strings. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.castComplexTypesToString.enabled` to `true`.

+  - In Spark 3.1, when loading a dataframe, `path` option cannot coexist with `load()`'s path parameters. For example, `spark.read.format("csv").option("path", "/tmp").load("/tmp2")` or `spark.read.option("path", "/tmp").csv("/tmp2")` will throw `org.apache.spark.sql.AnalysisException`. In Spark version 3.0 and below, `path` option is overwritten if one path parameter is passed to `load()`, or `path` option is added to the overall paths if multiple path parameters are passed to `load()`.


Should paths be mentioned too?

I mentioned paths as well.

viirya · 2020-08-13T05:51:05Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala


+    if ((extraOptions.contains("path") || extraOptions.contains("paths")) && paths.nonEmpty) {
+      throw new AnalysisException("There is a 'path' or 'paths' option set and load() is called " +
+        "with path parameters. Either remove the option or put it into the load() parameters.")


Either remove the option or not put it into the load() parameters?

I followed the suggestion from #29328 (comment), but let me know if the current message is still vague. Thanks!

to make it clearer:

Either remove the path option if it's the same as the path parameter, or add it to the load() parameter if you do want to read multiple paths.

viirya · 2020-08-13T05:55:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala

        paths = paths,
        className = classOf[TextFileFormat].getName,
-        options = options.parameters
+        options = options.parameters - "path"


Can you add some comments here at least? I feel it will confuse code readers later on why removing it...

SparkQA · 2020-08-20T05:24:17Z

Test build #127671 has finished for PR 29328 at commit ac007c0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala

SparkQA · 2020-08-20T07:05:01Z

Test build #127674 has finished for PR 29328 at commit 62794e5.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-20T21:11:03Z

Test build #127698 has finished for PR 29328 at commit d5c4203.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-08-21T06:20:05Z

Sorry, I forgot about one thing: we need a legacy config to restore to the previous behavior, in case users don't want to change their queries when upgrading Spark.

@imback82 can you add it?

imback82 · 2020-08-22T23:14:28Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

    // force invocation of `load(...varargs...)`
-    option("path", path).load(Seq.empty: _*)
+    if (sparkSession.sessionState.conf.legacyPathOptionBehavior) {
+      option("path", path).load(Seq.empty: _*)


This is needed to restore the legacy behavior. Note that the legacy behavior is different for the following:

spark.read.option("path", path).parquet(path) => path is read twice

spark.read.format("parquet").option("path", path).load(path) => path is read once (option is overwritten)

SparkQA · 2020-08-23T03:03:57Z

Test build #127789 has finished for PR 29328 at commit 92bf5ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-23T03:41:10Z

Test build #127790 has finished for PR 29328 at commit e1decd4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-08-24T16:30:28Z

thanks, merging to master!

cloud-fan · 2020-08-24T16:31:40Z

@imback82 please open a PR to do the same check for DataFrameWriter, DataStreamReader/Writer, thanks!

imback82 · 2020-08-24T17:17:09Z

Yes, will do. Thanks for the review!

…arameter for DataFrameWriter.save(), DataStreamReader.load() and DataStreamWriter.start() ### What changes were proposed in this pull request? This is a follow up PR to #29328 to apply the same constraint where `path` option cannot coexist with path parameter to `DataFrameWriter.save()`, `DataStreamReader.load()` and `DataStreamWriter.start()`. ### Why are the changes needed? The current behavior silently overwrites the `path` option if path parameter is passed to `DataFrameWriter.save()`, `DataStreamReader.load()` and `DataStreamWriter.start()`. For example, ``` Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2") ``` will write the result to `/tmp/path2`. ### Does this PR introduce _any_ user-facing change? Yes, if `path` option coexists with path parameter to any of the above methods, it will throw `AnalysisException`: ``` scala> Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2") org.apache.spark.sql.AnalysisException: There is a 'path' option set and save() is called with a path parameter. Either remove the path option, or call save() without the parameter. To ignore this check, set 'spark.sql.legacy.pathOptionBehavior.enabled' to 'true'.; ``` The user can restore the previous behavior by setting `spark.sql.legacy.pathOptionBehavior.enabled` to `true`. ### How was this patch tested? Added new tests. Closes #29543 from imback82/path_option. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…der/Writer should not fail ### What changes were proposed in this pull request? This is a followup of #29328 In #29328 , we forbid the use case that path option and path parameter are both specified. However, it breaks some use cases: ``` val dfr = spark.read.format(...).option(...) dfr.load(path1).xxx dfr.load(path2).xxx ``` The reason is that: `load` has side effects. It will set path option to the `DataFrameReader` instance. The next time you call `load`, Spark will fail because both path option and path parameter are specified. This PR removes the side effect of `save`/`load`/`start` to not set the path option. ### Why are the changes needed? recover some use cases ### Does this PR introduce _any_ user-facing change? Yes, some use cases fail before this PR, and can run successfully after this PR. ### How was this patch tested? new tests Closes #29723 from cloud-fan/df. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

initial commit

63f3d60

probot-autolabeler bot added the SQL label Aug 2, 2020

imback82 changed the title ~~[SPARK-32516][SQL] 'path' option should be treated consistently when loading dataframes for different APIs~~ [WIP][SPARK-32516][SQL] 'path' option should be treated consistently when loading dataframes for different APIs Aug 2, 2020

Merge branch 'master' into dfw_option

b4a2096

fix unit tests partially

296d4bb

cloud-fan reviewed Aug 3, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala Outdated Show resolved Hide resolved

dongjoon-hyun marked this pull request as draft August 3, 2020 21:45

imback82 added 5 commits August 3, 2020 22:09

v1 fix

c7fc414

address comments

ba2232f

Merge branch 'master' into dfw_option

099d08b

Address PR comments

9b6be0c

revert changes

17c94a4

imback82 commented Aug 6, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala Show resolved Hide resolved

remove a test

4c6ada7

imback82 commented Aug 6, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala Show resolved Hide resolved

cloud-fan reviewed Aug 6, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala Show resolved Hide resolved

imback82 added 2 commits August 6, 2020 12:21

Address comments

60b8e82

function rename

cd15fe2

imback82 marked this pull request as ready for review August 6, 2020 19:30

imback82 changed the title ~~[WIP][SPARK-32516][SQL] 'path' option should be treated consistently when loading dataframes for different APIs~~ [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters Aug 6, 2020

cloud-fan reviewed Aug 7, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala Outdated Show resolved Hide resolved

viirya reviewed Aug 7, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala Outdated Show resolved Hide resolved

viirya reviewed Aug 13, 2020

View reviewed changes

Merge branch 'master' into dfw_option

ab7fc24

imback82 changed the title ~~[SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters~~ [SPARK-32516][SQL] 'path' option cannot coexist with load()'s path parameters Aug 20, 2020

address PR comments

ac007c0

test fix

62794e5

cloud-fan reviewed Aug 20, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala Show resolved Hide resolved

address PR comments

d5c4203

imback82 added 2 commits August 22, 2020 15:35

add a config for legacy behavior

92bf5ef

fully restore the legacy behavior

e1decd4

imback82 commented Aug 22, 2020

View reviewed changes

cloud-fan closed this in e3a88a9 Aug 24, 2020

imback82 mentioned this pull request Aug 25, 2020

[SPARK-32516][SQL][FOLLOWUP] 'path' option cannot coexist with path parameter for DataFrameWriter.save(), DataStreamReader.load() and DataStreamWriter.start() #29543

Closed

cloud-fan mentioned this pull request Sep 11, 2020

[SPARK-32853][SQL] Consecutive save/load calls in DataFrame/StreamReader/Writer should not fail #29723

Closed


		- In Spark 3.1, NULL elements of structures, arrays and maps are converted to "null" in casting them to strings. In Spark 3.0 or earlier, NULL elements are converted to empty strings. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.castComplexTypesToString.enabled` to `true`.

		- In Spark 3.1, when loading a dataframe, `path` option cannot coexist with `load()`'s path parameters. For example, `spark.read.format("csv").option("path", "/tmp").load("/tmp2")` or `spark.read.option("path", "/tmp").csv("/tmp2")` will throw `org.apache.spark.sql.AnalysisException`. In Spark version 3.0 and below, `path` option is overwritten if one path parameter is passed to `load()`, or `path` option is added to the overall paths if multiple path parameters are passed to `load()`.

[SPARK-32516][SQL] 'path' option cannot coexist with load()'s path parameters #29328

[SPARK-32516][SQL] 'path' option cannot coexist with load()'s path parameters #29328

Uh oh!

Conversation

imback82 commented Aug 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

imback82 commented Aug 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Aug 2, 2020

Uh oh!

imback82 commented Aug 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Aug 3, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Aug 6, 2020

Uh oh!

SparkQA commented Aug 6, 2020

Uh oh!

SparkQA commented Aug 6, 2020

Uh oh!

SparkQA commented Aug 6, 2020

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Aug 12, 2020

Uh oh!

SparkQA commented Aug 12, 2020

Uh oh!

cloud-fan commented Aug 12, 2020

Uh oh!

imback82 commented Aug 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

imback82 commented Aug 13, 2020

Uh oh!

cloud-fan commented Aug 13, 2020

Uh oh!

viirya Aug 13, 2020

Choose a reason for hiding this comment

Uh oh!

imback82 Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Aug 13, 2020

Choose a reason for hiding this comment

Uh oh!

imback82 Aug 20, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan Aug 20, 2020

Choose a reason for hiding this comment

Uh oh!

viirya Aug 13, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 20, 2020

Uh oh!

Uh oh!

SparkQA commented Aug 20, 2020

Uh oh!

SparkQA commented Aug 20, 2020

Uh oh!

cloud-fan commented Aug 21, 2020

Uh oh!

imback82 Aug 22, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 23, 2020

imback82 commented Aug 2, 2020 •

edited

Loading

imback82 commented Aug 2, 2020 •

edited

Loading

imback82 commented Aug 3, 2020 •

edited

Loading

imback82 commented Aug 13, 2020 •

edited

Loading

imback82 Aug 20, 2020 •

edited

Loading